overview

  • introduction to PPLs
  • modelling workflow:
    • prior checks
    • sampling diagnostics
    • model evaluation
    • experimental design
  • …if we have time:
    • measurement error
    • mixture models
    • hierarchical models

from BUGS/JAGS to Stan

  • Gibbs sampling:
    • draw samples from one parameter at a time
    • helpful when Metropolis-Hastings is rejecting proposals
  • Hamiltonian Monte Carlo:
    • uses gradient of log-posterior and Hamiltonian dynamics analogy
    • samples from all continuous parameters jointly and efficiently
    • extended to No-U-Turn Sampler (NUTS) in Stan, which tunes hyperparameters during warmup!
  • link to a fun demo

from BUGS/JAGS to Stan

Hoffman, M.D., Gelman, A., ‘The No-U-Turn Sampler’, JMLR, 2014

from BUGS/JAGS to Stan

StanCon anyone?

PPLs

‘democratising scalable UQ’ …once installed 🤔

Code
library(cmdstanr)

cmdstanr::cmdstan_version()
[1] "2.34.1"
Code
import cmdstanpy

cmdstanpy.cmdstan_version()
(2, 34)
Code
using Turing, Pkg

Pkg.status("Turing")
Status `~/.julia/environments/v1.11/Project.toml`
⌃ [fce5fe82] Turing v0.35.4
Info Packages marked with ⌃ have new versions available and may be upgradable.

corrosion rates

looking for evidence of active corrosion growth

corrosion rates

loading data

Code
library(tidyverse)

corrosion_data <- read_csv("../../data/corrosion.csv")
head(corrosion_data, 3)
# A tibble: 3 × 5
  anomaly_id soil_type inspection measured_depth_mm     T
       <dbl> <chr>     <chr>                  <dbl> <dbl>
1          1 A         i_0                   0.0672     0
2          2 A         i_0                   0.104      0
3          3 A         i_0                   0.101      0
Code
import polars as pl

corrosion_data = pl.read_csv("../../data/corrosion.csv")
corrosion_data.head(3)
shape: (3, 5)
anomaly_id soil_type inspection measured_depth_mm T
i64 str str f64 i64
1 "A" "i_0" 0.067209 0
2 "A" "i_0" 0.104202 0
3 "A" "i_0" 0.100588 0
Code
using CSV, DataFrames

corrosion_data = CSV.read("../../data/corrosion.csv", DataFrame)
first(corrosion_data, 3)
3×5 DataFrame
 Row │ anomaly_id  soil_type  inspection  measured_depth_mm  T
     │ Int64       String1    String3     Float64            Int64
─────┼─────────────────────────────────────────────────────────────
   1 │          1  A          i_0                 0.0672091      0
   2 │          2  A          i_0                 0.104202       0
   3 │          3  A          i_0                 0.100588       0

a corrosion growth rate model

\[\begin{aligned} & \Delta C = \frac{C_{j} - C_{i}}{\Delta t_{i \rightarrow j}} \\ \\ & \Delta C \sim N(\mu, \sigma^2) \end{aligned}\]
Code
cgr_model <- cmdstan_model(stan_file = "corrosion_growth.stan")

cgr_model$format()
data {
  int<lower=0> n_anomalies; // number of anomalies
  int<lower=0> n_inspections; // number of inspections
  vector[n_anomalies * n_inspections] cgr; // growth rate observations (2 per anomaly)
}
parameters {
  real<lower=0> mu; // mean growth rate
  real<lower=0> sigma; // standard deviation
}
model {
  // model
  for (i in 1 : (n_anomalies * n_inspections)) {
    cgr[i] ~ normal(mu, sigma);
  }
  /*
  //alternative (vectorised) implementation:
  delta_C ~ normal(mu, sigma);
  
  //some suggested priors
  mu ~ normal(1/4, 3);
  sigma ~ exponential(1);
  */
}
generated quantities {
  vector[n_anomalies * n_inspections] cgr_pred;
  //vector[n_anomalies * n_inspections] log_lik;
  
  for (i in 1 : (n_anomalies * n_inspections)) {
    cgr_pred[i] = normal_rng(mu, sigma);
    //log_lik[i] = normal_lpdf(delta_C[i] | mu, sigma);
  }
}
Code
cgr_model = cmdstanpy.CmdStanModel(stan_file="corrosion_growth.stan")
INFO:cmdstanpy:found newer exe file, not recompiling
Code
stan_code = cgr_model.code()

from pygments import highlight
from pygments.lexers import StanLexer
from pygments.formatters import NullFormatter

formatted_stan_code = highlight(stan_code, StanLexer(), NullFormatter())

print(formatted_stan_code)
data {
  int <lower = 0> n_anomalies;  // number of anomalies
  int <lower = 0> n_inspections;  // number of inspections
  vector[n_anomalies * n_inspections] cgr;  // growth rate observations (2 per anomaly)
}

parameters {
  real<lower = 0> mu;  // mean growth rate
  real<lower = 0> sigma;  // standard deviation
}

model {  
  // model
  for(i in 1:(n_anomalies * n_inspections)){
    cgr[i] ~ normal(mu, sigma);
  }
  /*
  //alternative (vectorised) implementation:
  delta_C ~ normal(mu, sigma);

  //some suggested priors
  mu ~ normal(1/4, 3);
  sigma ~ exponential(1);
  */
  
}

generated quantities {
  vector[n_anomalies * n_inspections] cgr_pred;
  //vector[n_anomalies * n_inspections] log_lik;

  for (i in 1:(n_anomalies * n_inspections)) {
    cgr_pred[i] = normal_rng(mu, sigma);
    //log_lik[i] = normal_lpdf(delta_C[i] | mu, sigma);
  }
}
Code
@model function corrosion_growth(cgr)
    # priors
    μ ~ Normal(0, 2) |> d -> truncated(d, lower = 0)
    σ ~ Exponential(1)
    
    # model
    for i in eachindex(cgr)
        cgr[i] ~ Normal(μ, σ) |> d -> truncated(d, lower = 0)
    end

    # Turing automatically keeps track of log-likelihoods 🏆 
    
end

running the model

CmdStanR needs it’s input data as a list

Code
prepare_data <- function(df = corrosion_data) {
  df |>
    arrange(anomaly_id, T) |>
    group_by(anomaly_id) |>
    mutate(
      next_depth = lead(measured_depth_mm),
      time_diff = lead(T) - T
    ) |>
    filter(!is.na(next_depth)) |>
    mutate(
      delta_C = (next_depth - measured_depth_mm) / time_diff
    ) |>
    select(anomaly_id, delta_C, soil_type) |>
    ungroup()
}

model_data <- list(
  n_anomalies = prepare_data()$anomaly_id |> unique() |> length(),
  n_inspections = 2,
  cgr = prepare_data()$delta_C
)

cgr_post <- cgr_model$sample(data = model_data)
Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.1 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.1 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.1 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.8 seconds.

CmdStanPy needs it’s input data as a dictionary

Code
def prepare_data(df = corrosion_data):
    return (
        df.sort(['anomaly_id', 'T'])
        .group_by('anomaly_id')
        .agg([
            pl.col('measured_depth_mm').shift(-1).alias('next_depth'),
            pl.col('T').shift(-1).alias('next_time'),
            pl.col('measured_depth_mm'),
            pl.col('T')
        ])
        .filter(pl.col('next_depth').is_not_null())
        .with_columns([
            ((pl.col('next_depth') - pl.col('measured_depth_mm')) / 
             (pl.col('next_time') - pl.col('T'))).alias('delta_C')
        ])
        .select(['anomaly_id', 'delta_C'])
        .explode('delta_C')  # Add this line to unnest the lists
        .filter(pl.col('delta_C').is_not_null())  # Optional: remove null values if any
    )

model_data = {
        'n_anomalies': prepare_data().select('anomaly_id').unique().height,
        'n_inspections': 2,
        'cgr': prepare_data().select('delta_C').to_series().to_numpy()
    }

cgr_post = cgr_model.sample(data = model_data)
                                                                                                                                                                                                                                                                                                                                

INFO:cmdstanpy:CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status


chain 3 |          | 00:00 Status



chain 4 |          | 00:00 Status
chain 1 |#######2  | 00:00 Iteration: 1300 / 2000 [ 65%]  (Sampling)


chain 3 |######8   | 00:00 Iteration: 1200 / 2000 [ 60%]  (Sampling)

chain 2 |######8   | 00:00 Iteration: 1200 / 2000 [ 60%]  (Sampling)



chain 4 |######8   | 00:00 Iteration: 1200 / 2000 [ 60%]  (Sampling)
chain 1 |##########| 00:00 Sampling completed                       

chain 2 |##########| 00:00 Sampling completed                       

chain 3 |##########| 00:00 Sampling completed                       

chain 4 |##########| 00:00 Sampling completed                       
INFO:cmdstanpy:CmdStan done processing.

A Turing model needs it’s input data as arguments to the model function

Code
function prepare_data(df::DataFrame = corrosion_data)
    sorted_df = sort(df, [:anomaly_id, :T]); result = DataFrame()
    
    for group in groupby(sorted_df, :anomaly_id)
        if nrow(group) > 1
            for i in 1:(nrow(group)-1)
                Δc = (group[i+1, :measured_depth_mm] - group[i, :measured_depth_mm]) / (group[i+1, :T] - group[i, :T])
                push!(result, (
                    anomaly_id = group[i, :anomaly_id], Δc = max(0, Δc)
                ))
            end
        end
    end
    
    return result
end

cgr_post = prepare_data().Δc |> 
  data -> corrosion_growth(data) |>
  model -> sample(model, NUTS(), 1_000)

taking a look

up and running with PPLs

success

next up: how we can extend this to a robust and helpful workflow

break?

comparisons in Turing.jl

Code
n_draws = 1_000; n_chains = 4

# a no U-turn sampler, with 2000 adaptive steps and a target acceptance rate of 0.65
NUTS_sampler = NUTS(2_000, 0.65)

# a Hamiltonian Monte Carlo sampler, with a step size of 0.05 and 10 leapfrog steps
HMC_sampler = HMC(0.05, 10)

# a Metropolis-Hastings sampler, using the default proposal distribution (priors)
MH_sampler = MH()

# a 'compositional' Gibbs sampler (Metropolis within Gibbs) - sampling μ with MH and σ with NUTS
Gibbs_sampler = Gibbs(MH(:μ), NUTS(2_000, 0.65, :σ))

run_mcmc = function(sampler)
    return prepare_data().Δc |> 
      data -> corrosion_growth(data) |>
      model -> sample(model, sampler, MCMCThreads(), n_draws, n_chains)
end